A String Similarity Measure Based on Orthographic and Phonetic Similarity for Spelling Correction

نویسندگان

  • Do-Gil Lee
  • Ilhwan Kim
  • Seok Kee Lee
چکیده

The most commonly used string similarity measure for spelling correction is minimum edit distance (MED), which is based solely on the orthographic similarity between two strings. In order to overcome this shortcoming, this paper presents a more sophisticated similarity measure that considers both the orthographic and phonetic similarity between two strings. To demonstrate the effectiveness of the proposed measure, we implement and test a spelling correction system and apply it to two languages, namely English and Korean. We investigate the useful features for each language through various experiments and achieve 10-best accuracies of 95.2 for English and 97.4 for Korean.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Alignment-Based Discriminative String Similarity

A character-based measure of similarity is an important component of many natural language processing systems, including approaches to transliteration, coreference, word alignment, spelling correction, and the identification of cognates in related vocabularies. We propose an alignment-based discriminative framework for string similarity. We gather features from substring pairs consistent with a...

متن کامل

Finding Approximate Matches in Large Lexicons

Approximate string matching is used for spelling correction and personal name matching. In this paper we show how to use string matching techniques in conjunction with lexicon indexes to find approximate matches in a large lexicon. We test several lexicon indexing techniques, including n-grams and permuted lexicons, and several string matching techniques, including string similarity measures an...

متن کامل

An Ensemble Method for Spelling Correction in Consumer Health Questions

Orthographic and grammatical errors are a common feature of informal texts written by lay people. Health-related questions asked by consumers are a case in point. Automatic interpretation of consumer health questions is hampered by such errors. In this paper, we propose a method that combines techniques based on edit distance and frequency counts with a contextual similarity-based method for de...

متن کامل

Identification of Confusable Drug Names: A New Approach and Evaluation Methodology

This paper addresses the mitigation of medical errors due to the confusion of sound-alike and look-alike drug names. Our approach involves application of two new methods— one based on orthographic similarity (“lookalike”) and the other based on phonetic similarity (“sound-alike”). We present a new recall-based evaluation methodology for determining the effectiveness of different similarity meas...

متن کامل

Fast Phonetic Similarity Search over Large Repositories

Today there is a large amount of unstructured data produced by information systems from different domains. These sources may be analyzed for different purposes. Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary. However, they have two main drawbacks. First, they are not rich enough to encode phonetic information to assist the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012